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Abstract 

Th^3 Shannon entropy is a widely used summary statistic, for example, net- 
work traffic measurement, anomaly detection, neural computations, spike trains, 
etc. This study focuses on estimating Shannon entropy of data streams. It is 
known that Shannon entropy can be approximated by Renyi entropy or Tsallis 
entropy, which are both functions of the ath frequency moments and approach 
Shannon entropy as a — > 1. Compressed Counting (CC)\24\ is a new method 
for approximating the ath frequency moments of data streams. Our contributions 
include: 

• We prove that Renyi entropy is (much) better than Tsallis entropy for ap- 
proximating Shannon entropy. 

• We propose the optimal quantile estimator for CC, which considerably im- 
proves the estimators in [24|. 

• Our experiments demonstrate that CC is indeed highly effective in approxi- 
mating the moments and entropies. We also demonstrate the crucial impor- 
tance of utilizing the variance-bias trade-off. 

1 Introduction 

The problem of "scaling up for high dimensional data and high speed data streams" is 
among the "ten challenging problems in data mining research" |[32l . This paper is de- 
voted to estimating entropy of data streams using a recent algorithm called Compressed 
Counting ( CC) ll24l . This work has four components: (1) the theoretical analysis of en- 
tropies, (2) a much improved estimator for CC, (3) the bias and variance in estimating 
entropy, and (4) an empirical study using Web crawl data. 



1.1 Relaxed Strict- Turnstile Data Stream Model 

While traditional data mining algorithms often assume static data, in reality, data are 
often constantly updated. Mining data streams lfT8l l4l [Tl l27l in (e.g.,) 100 TB scale 
databases has become an important area of research, e.g., ifTOl fll. as network data can 
easily reach that scale] 32 1. Search engines are a typical source of data streams [4 1. 

'Even earlier versions of this paper were submitted in 2008. 
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We consider the Turnstile stream model[27|. The input stream at = (it, It), H £ 
[1, D] arriving sequentially describes the underlying signal A, meaning 

A t [i t ]=A t -i{i t ]+It, (1) 

where the increment I t can be either positive (insertion) or negative (deletion). For 
example, in an online store, may record the total number of items that user i 

has ordered up to time t — 1 and I t denotes the number of items that this user orders 
(It > 0) or cancels (I t < 0) at t. If each user is identified by the IP address, then 
potentially D = 2 64 . It is often reasonable to assume A t [i] > 0, although I t may be 
either negative or positive. Restricting A t [i] > results in the strict-Turnstile model, 
which suffices for describing almost all natural phenomena. For example, in an online 
store, it is not possible to cancel orders that do not exist. 

Compressed Counting (CC) assumes a relaxed strict-Turnstile model by only en- 
forcing A t [i] > at the t one cares about. At other times s ^ t, A s [i] can be arbitrary. 

1.2 Moments and Entropies of Data Streams 

The ath frequency moment is a fundamental statistic: 

D 

F (a) =J2M<] a - (2) 

i=i 

When a = 1, Fn\ is the sum of the stream. It is obvious that one can compute Fm 
exactly and trivially using a simple counter, because = Y^f=i A t [i] = Z^I=o F- 
A t is basically a histogram and we can view 

_ A t [i] _ A*® 

as probabilities. A useful (e.g., in Web and networks lfl2l |2T1 [33l l26l and neural 
comptutations ll28l ) summary statistic is Shannon entropy 

frt Fm f w ^ 

Various generalizations of the Shannon entropy exist. The Renyi entropy |29|, denoted 
by H a , is defined as 



Ha = log 



j2i =1 Asr 



3- j- log (5) 



The Tsallis entropy! 17, 30|, denoted by T a , is defined as, 

01 - 1 V f (d / Q ~ 1 

As a — > 1, both Renyi entropy and Tsallis entropy converge to Shannon entropy: 

lim H a = lim T a = H. 

a — > 1 a — > 1 

Thus, both Renyi entropy and Tsallis entropy can be computed from the ath frequency 
moment; and one can approximate Shannon entropy from either H a or T a by using 
awl. Several studies ll33l |T3Tl ) used this idea to approximate Shannon entropy. 
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1.3 Sample Applications of Shannon Entropy 

1.3.1 Real-Time Network Anomaly Detection 

Network traffic is a typical example of high-rate data streams. An effective and reli- 
able measurement of network traffic in real-time is crucial for anomaly detection and 
network diagnosis; and one such measurement metric is Shannon entropy lfl2l |20l [3"T1 
|6]|2T][33]. The Turnstile data stream model (JTJ is naturally suitable for describing net- 
work traffic, especially when the goal is to characterize the statistical distribution of 
the traffic. In its empirical form, a statistical distribution is described by histograms, 
At[i], i = 1 to D. It is possible that D = 2 64 (IPV6) if one is interested in measuring 
the traffic streams of unique source or destination. 

The Distributed Denial of Service (DDoS) attack is a representative example of net- 
work anomalies. A DDoS attack attempts to make computers unavailable to intended 
users, either by forcing users to reset the computers or by exhausting the resources 
of service-hosting sites. For example, hackers may maliciously saturate the victim 
machines by sending many external communication requests. DDoS attacks typically 
target sites such as banks, credit card payment gateways, or military sites. 

A DDoS attack changes the statistical distribution of network traffic. Therefore, 
a common practice to detect an attack is to monitor the network traffic using certain 
summary statics. Since Shannon entropy is a well-suited for characterizing a distribu- 
tion, a popular detection method is to measure the time-history of entropy and alarm 
anomalies when the entropy becomes abnormal [fl2l 1211 . 

Entropy measurements do not have to be "perfect" for detecting attacks. It is how- 
ever crucial that the algorithm should be computationally efficient at low memory cost, 
because the traffic data generated by large high-speed networks are enormous and tran- 
sient (e.g., 1 Gbits/second). Algorithms should be real-time and one-pass, as the traffic 
data will not be stored[4|. Many algorithms have been proposed for "sampling" the 
traffic data and estimating entropy over data streams ll2Tl[331 l5l[T^[3ll7l [T6l[T5ll . 

1.3.2 Anomaly Detection by Measuring OD Flows 

In high-speed networks, anomaly events including network failures and DDoS attacks 
may not always be detected by simply monitoring the traditional traffic matrix because 
the change of the total traffic volume is sometimes small. One strategy is to measure the 
entropies of all origin-destination (OD) flows[33|. An OD flow is the traffic entering 
an ingress point (origin) and exiting at an egress point (destination). 

||33ll showed that measuring entropies of OD flows involves measuring the inter- 
section of two data streams, whose moments can be decomposed into the moments of 
individual data streams (to which CC is applicable) and the moments of the absolute 
difference between two data streams. 

1.3.3 Entropy of Query Logs in Web Search 

The recent work|26| was devoted to estimating the Shannon entropy of MSN search 
logs, to help answer some basic problems in Web search, such as, how big is the web? 

The search logs can be viewed as data streams, and [26 1 analyzed several "snap- 
shots" of a sample of MSN search logs. The sample used in [26 1 contained 10 million 
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<Query, URL,IP> triples; each triple corresponded to a click from a particular IP ad- 
dress on a particular URL for a particular query. Il26l drew their important conclusions 
on this (hopefully) representative sample. Alternatively, one could apply data stream 
algorithms such as CC on the whole history of MSN (or other search engines). 

1.3.4 Entropy in Neural Computations 

A workshop in NIPS '03 was denoted to entropy estimation, owing to the wide-spread 

use of Shannon entropy in Neural Computations ||28l . ( |http : / /www.menem. com/~ilya/pages/NIPS03| l 

For example, one application of entropy is to study the underlying structure of spike 

trains. 

1.4 Related Work 

Because the elements, A t [i], are time-varying, a naive counting mechanism requires 
a system of D counters to compute exactly (unless a = 1). This is not always 
realistic. Estimating F( a \ in data streams is heavily studiedll2l [TT1 [131 [191 |23l . We 
have mentioned that computing in strict-Turnstile model is trivial using a simple 
counter. One might naturally speculate that when awl, computing (approximating) 
should be also easy. However, before Compressed Counting ( CC), none of the 
prior algorithms could capture this intuition. 

CC improves symmetric stable random projections^^ 1231 uniformly for all < 
a < 2 as shown in Figure[3]in Section|U However, one can still considerably improve 
CC around a = 1, by developing better estimators, as in this study. In addition, no 
empirical studies on CC were reported. 

lF33l applied symmetric stable random projections to approximate the moments and 
Shannon entropy. The nice theoretical work Ifl6l \T5l provided the criterion to choose 
the a so that Shannon entropy can be approximated with a guaranteed accuracy, using 
the ath frequency moment. 

1.5 Summary of Our Contributions 

• We prove that using Renyi entropy to estimate Shannon entropy has (much) 
smaller bias than using Tsallis entropy. When data follow a common Zipf 
distribution, the difference could be a magnitude. 

• We provide a much improved estimator. CC boils down to a statistical estima- 
tion problem. The new estimator based on optimal quantiles exhibits consider- 
ably smaller variance when awl, compared to |24|. 

• We demonstrate the bias-variance trade-off in estimating Shannon entropy, 

important for choosing the sample size and how small |a — 1| should be. 

• We supply an empirical study. 

1.6 Organization 

Section|2]illustrates what entropies are like in real data. Section[3]includes some theo- 
retical studies of entropy. The methodologies of CC and two estimators are reviewed in 
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Section[4] The new estimator based on the optimal quantiles is presented in Section|5] 
We analyze in Section|6]the biases and variances in estimating entropies. Experiments 
on real Web crawl data are presented in Section [7] 

2 The Data Set 

Since the estimation accuracy is what we are interested in, we can simply use static 
data instead of real data streams. This is because at the time t, Fr^ — At[i] a is 

the same at the end of the stream, regardless whether it is collected at once (i.e., static) 
or incrementally (i.e., dynamic). 

Ten English words are selected from a chunk of Web crawl data with D = 2 16 = 
65536 pages. The words are selected fairly randomly, except that we make sure they 
cover a whole range of sparsity, from function words (e.g., A, THE), to common words 
(e.g., FRIDAY) to rare words (e.g., TWIST). The data are summarized in TableQ] 



Table 1: The data set consists of 10 English words selected from a chunk of D — 65536 Web 
pages, forming 10 vectors of length D whose values are the word occurrences. The table lists 



of non-zeros 


(sparsity) 


H, H a 


andT Q 


(for a = 


0.95 and 


1.05). 


Word 


Nonzero 


H 


#0.95 


#1.05 


To. 95 


Tl.05 


TWIST 


274 


5.4873 


5.4962 


5.4781 


6.3256 


4.7919 


RICE 


490 


5.4474 


5.4997 


5.3937 


6.3302 


4.7276 


FRIDAY 


2237 


7.0487 


7.1039 


6.9901 


8.5292 


5.8993 


FUN 


3076 


7.6519 


7.6821 


7.6196 


9.3660 


6.3361 


BUSINESS 


8284 


8.3995 


8.4412 


8.3566 


10.502 


6.8305 


NAME 


9423 


8.5162 


9.5677 


8.4618 


10.696 


6.8996 


HAVE 


17522 


8.9782 


9.0228 


8.9335 


11.402 


7.2050 


THIS 


27695 


9.3893 


9.4370 


9.3416 


12.059 


7.4634 


A 


39063 


9.5463 


9.5981 


9.4950 


12.318 


7.5592 


THE 


42754 


9.4231 


9.4828 


9.3641 


12.133 


7.4775 



\ A 






- - - Tsallis 

— Renyi 








- - Shannon 
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Figure 1: Two words are selected for comparing three entropies. The Shannon entropy 
is a constant (horizontal line). 

Figure Q] selects two words to compare their Shannon entropies H, Reny entropies 
H a , and Tsallis entropies T a . Clearly, although both approach Shannon entropy, Reny 
entropy is much more accurate than Tsallis entropy. 

3 Theoretical Analysis of Entropy 

This section presents two Lemmas, proved in the Appendix. Lemma Q] says Renyi 
entropy has smaller bias than Tsallis entropy for estimating Shannon entropy. 
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Lemma 1 



\H a -H\<\T a -H\. (7) 

Lemma [TJ does not say precisely how much better. Note that when a — > 1, the 
magnitudes of \H a — H\ and \T a — H\ are largely determined by the first derivatives 
(slopes) of H a and T a , respectively, evaluated at a — ► 1. Lemma [2] directly compares 
their first and second derivatives, as a — > 1. 

Lemma 2 As a — > 1, 

»=1 i=l 
1 / C \ 2 1 ° 

ff « 2 ( l0gPi ~ 2^ Pi log2pi ' 

\i=l / i=l 

1 D D 1 ° 

i — l i— 1 i — 1 

Lemma |2] shows that in the limit, \H[\ < |T{|, verifying that iJ a should have smaller 
bias than T a . Also, \H'{\ < \T['\. Two special cases are interesting. 



3.1 Uniform Data Distribution 

In this case, = j) f° r a U It is eas Y to show that i/ Q = regardless of a. Thus, 
when the data distribution is close to be uniform, Renyi entropy will provide nearly 
perfect estimates of Shannon entropy. 

3.2 Zipf Data Distribution 

In Web and NLP applications! 25 1, the Zipf distribution is common: pi oc -hf. 




Zipf exponent (y) 

Figure 2: Ratios of the two first derivatives, for D = 10 3 , 10 4 , 10 5 , 10 6 , 10 7 . The 
curves largely overlap and hence we do not label the curves. 

Figure |2]plots the ratio, l™'^'? ■ At 7 as 1 (which is common ll25l ). the ratio is 
about 10, meaning that the bias of Renyi entropy could be a magnitude smaller than 
that of Tsallis entropy, in common data sets. 
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4 Review Compressed Counting (CC) 

Compressed Counting (CC) assumes the relaxed strict-Turnstile data stream model. Its 
underlying technique is based on maximally-skewed stable random projections. 

4.1 Maximally-Skewed Stable Distributions 

A random variable Z follows a maximally-skewed a-stable distribution if the Fourier 
transform of its density is|34| 



where < a < 2, F > 0, and f3 = 1. We denote Z ~ S(a,f3 = 1,F). The 
skewness parameter (3 for general stable distributions ranges in [—1,1]; but CC uses 
(3 = 1, i.e., maximally-skewed. Previously, the method of symmetric stable random 
projections ftl9\ l23l used (3 — 0. 

Consider two independent variables, Z\,Z 2 ~ S(a,f3 — 1,1). For any non- 
negative constants C\ and C2, the "a-stability" follows from properties of Fourier 
transforms: 



Note that if (3 — 0, then the above stability holds for any constants C\ and C 2 - 

We should mention that one can easily generate samples from a stable distribution! 8 1- 

4.2 Random Projections 

Conceptually, one can generate a matrix R e M. Dxk and multiply it with the data 
stream A t , i.e., X — H T A t £ M. k . The resultant vector X is only of length k. The 
entries of R, , are i.i.d. samples of a stable distribution S(a, (3 = 1,1). 

By property of Fourier transforms, the entries of X,xj j — 1 to k, are i.i.d. samples 
of a stable distribution 



whose scale parameter is exactly the ath moment. Thus, CC boils down to a 
statistical estimation problem. 

For real implementations, one should conduct R T j4 f incrementally. This is possible 
because the Turnstile model ([TJ is a linear updating model. That is, for every incoming 
a t = (it, It), we update Xj <— Xj + r it jl t for j = 1 to k. Entries of R are generated 
on-demand as necessary. 

4.3 The Efficiency in Processing Time 

lfl3l commented that, when k is large, generating entries of R on-demand and multi- 
plications Ti t jl t , j = 1 to k, can be prohibitive. An easy "fix" is to use k as small as 
possible, which is possible with CC when a « 1. 




Z = C X Z X + C 2 Z 2 ~ S (a, (3 = 1, Cf + C 2 Q ) . 
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At the same k, all procedures of CC and symmetric stable random projections are 
the same except the entries in R follow different distributions. However, since CC is 
much more accurate especially when a w 1, it requires a much smaller k at the same 
level of accuracy. 

4.4 Two Statistical Estimators for CC 

CC boils down to estimating Fr a \ from k i.i.d. samples Xj ~ S (a, (3 = 1, Fr a \). Il24l 
provided two estimators. 

4.4.1 The Unbiased Geometric Mean Estimator 

nj =1 i<r-i-/* 



F{<x),gm — (8) 

J-Jgm 

)]"■ 



.,_(.,.(^) / „,(^)),[|,„,( S ) r (,.i |l 

k{ol) — a. if a < 1, re(ct) — 2 — a if a > 1. 

The asymptotic (i.e., as A; — > oo) variance is 



¥^(l-« 2 )+0(4,), *<! 
■(^,, Sm )=<{ p2 0) 
-^i^(a-l)(5-a) + o(^) ,a > 1 



As a — > 1, the asymptotic variance approaches zero. 



4.4.2 The Harmonic Mean Estimator 

, cosf ) 

- fc T(W / 1 / 2r 2 (l + q) \\ 

f(a),hm — j j 1 - T rfi : o \ 1 ' ^ iU ^ 

£*=ifc,-|- a \ fcvr(i + 2a) jj 

which is asymptotically unbiased and has variance 

M^)=t(^->)+°(F> 

F(a),hm is defined only for a < 1 and is considerably more accurate than the 
geometric mean estimator Fr a \ gm . 

5 The Optimal Quantile Estimator 

The two estimators for CC in [24] dramatically reduce the estimation variances com- 
pared to symmetric stable random projections. They are, however, are not quite ade- 
quate for estimating Shannon entropy using small (k) samples. 

We discover that an estimator based on the sample quantiles considerably improves 
1 24 1 when a ss 1. Given k i.i.d samples Xj ~ S (a, 1, -F( a )), we define the q-quantile 
is to be the (q x fc)th smallest of \xj\. For example, when k = 100, then q = 0.01th 
quantile is the smallest among \xj |'s. 
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To understand why the quantile works, consider the normal Xj ~ N(0, a 2 ), which 
is a special case of stable distribution with a = 2. We can view Xj — crzj, where 
Zj ~ N(0, 1). Therefore, we can use the ratio of the qth quantile of Xj over the g-th 
quantile of N(0, 1) to estimate a. Note that Ft a \ corresponds to a 2 , not a. 

5.1 The General Quantile Estimator 

Assume Xj ~ S (a, j = 1 to fc. One can sort \xj\ and use the (g x fc)th 

smallest \xj\ as the estimate, i.e., 

~ / q-Quant&e{\xj\,j = 1,2,..., k} 
F^u - ^ Wq 

W q = g-Quantile{|5(a, / S = 1, 1)|}. (13) 

Denote Z = \X\, where X ~ S (a, 1, i^a)) ■ Denote the probability density function 
of Z by /z (z; a, f( Q )), the probability cumulative function by F% (z; a, Fr a -)), and 
the inverse by F^ 1 (q; a, Fr a \) . The asymptotic variance of F^ a y q is presented in 
Lemma[3] which follows directly from known statistics results, e.g., (51 Theorem 9.2]. 




Lemma 3 



y J k SI {Ft («;<*, l);a,l) (V fea.l)) Vfe^ 
We can then choose q = q* to minimize the asymptotic variance. 



5.2 The Optimal Quantile Estimator 

We denote the optimal quantile estimator by F/ a ^ oq — F( a ),q* ■ The optimal quantiles, 
denoted by q* = q* (a), has to be determined by numerically and tabulated (as in Table 
H), because the density functions do not have an explicit closed-form. We used the 
fBasics package in R. We, however, found the implementation of those functions had 
numerical problems when 1 < a < 1.011 and 0.989 < a < 1. Table |2]provides the 
numerical values for q*, W q * Sl3[ , and the variance of Fi a \ oq (without the r term). 



Table 2: In order to use the optimal quantile estimator, we tabulate the constants q* and W q * . 



a 


q 


Var 




0.90 


0.101 


0.04116676 


5.400842 


0.95 


0.098 


0.01059831 


1.174773 


0.96 


0.097 


0.006821834 


14.92508 


0.97 


0.096 


0.003859153 


20.22440 


0.98 


0.0944 


0.001724739 


30.82616 


0.989 


0.0941 


0.0005243589 


56.86694 


1.011 


0.8904 


0.0005554749 


58.83961 


1.02 


0.8799 


0.001901498 


32.76892 


1 .03 


0.869 


0.004424189 


22.13097 


1.04 


0.861 


0.008099329 


16.80970 


1.05 


0.855 


0.01298757 


13.61799 


1.20 


0.799 


0.2516604 


4.011459 
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5.3 Comparisons of Asymptotic Variances 

Figure[3](left panel) compares the variances of the three estimators for CC. To better il- 
lustrate the improvements, Figure[3](right panel) plots the ratios of the variances. When 
a = 0.989, the optimal quantile reduces the variances by a factor of 70 (compared to 
the geometric mean estimator), or 20 (compared to the harmonic mean estimator). 
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Figure 3: Let F be an estimator of F with asymptotic variance Var [FJ = V^- + 
O (-p-)- The left panel plots the V values for the geometric mean estimator, the har- 
monic mean estimator (for a < 1), and the optimal quantile estimator, along with the 
V values for the geometric mean estimator for symmetric stable random projections 
in [23] ("symmetric GM"). The right panel plots the ratios of the variances to better 
illustrate the significant improvement of the optimal quantile estimator, near a = 1. 



6 Estimating Shannon Entropy, the Bias and Variance 

This section analyzes the biases and variances in estimating Shannon entropy. Also, 
we provide the criterion for choosing the sample size k. 
We use Ft a ), H a , and T a to denote generic estimators. 



Since i<V a ) is (asymptotically) unbiased, H a and T a are also asymptotically unbi- 
ased. The asymptotic variances of H a and T a can be computed by Taylor expansions: 



Var K> =(i J ^ Var ( log (>>)) 



(l-a) a V V 9F {a) J " Vfc 2 

= (T^4 Var ^ + °(^)- (16) 

V- (f a ) =-^^Var + O (-1) . (17) 



We use H a .u and H a ^T to denote the estimators for Shannon entropy using the 
estimated H a and T a , respectively. The variances remain unchanged, i.e., 

Var = Var , Var (h q , t ) = Var (f Q ) . (18) 
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However, H a ,R and H a ,T are no longer (asymptotically) unbiased, because 

Bias (£«,,«) = E [k a , R -H)=H a -H + 0\^\, (19) 

Bias (H a ,T) = E (f a ,n - tf) = T„ - + O f~J . (20) 

The O (i) biases arise from the estimation biases and diminish quickly as fc increases. 
However, the "intrinsic biases," H a — H and T a — H, can not be reduced by increasing 
fc; they can only be reduced by letting a close to 1 . 

The total error is usually measured by the mean square error: MSE = Bias 2 + Var. 
Clearly, there is a variance -bias trade-off in estimating H using H a or T a . The optimal 
a is data-dependent and hence some prior knowledge of the data is needed in order to 
determine it. The prior knowledge may be accumulated during the data stream process. 



7 Experiments 

Experiments on real data (i.e., Table Q]) can further demonstrates the effectiveness of 
Compressed Counting (CC) and the new optimal quantile estimator. We could use 
static data to verify CC because we only care about the estimation accuracy, which is 
same regardless whether the data are collected at one time (static) or dynamically. 

We present the results for estimating frequency moments and Shannon entropy, in 
terms of the normalized MSEs. We observe that the results are quite similar across 
different words; and hence only one word is selected for the presentation. 

7.1 Estimating Frequency Moments 

Figure |4] provides the normalized MSEs (by F? a \) for estimating the ath frequency 
moments, Fua, for word RICE: 

• The errors of the three estimators for CC decrease (to zero, potentially) as a — > 1. 
The improvement of CC over symmetric stable random projections is enormous. 

• The optimal quantile estimator Fr a Y oq is in general more accurate than the geo- 
metric mean and harmonic mean estimators near a = 1. However, for small fc 
and a > 1, F/ a \ oq exhibits bad behaviors, which disappear when fc > 50. 

• The theoretical asymptotic variances in (0, ( fTTT l, and Table|2]are accurate. 

7.2 Estimating Shannon Entropy 

Figure [5] provides the MSEs from estimating the Shannon entropy using the Renyi 
entropy, for word RICE: 

• Using symmetric stable random projections with a close to 1 is not a good strat- 
egy and not practically feasible because the required sample size is enormous. 

• There is clearly a variance-bias trade-off, especially for the geometric mean and 
harmonic mean estimators. That is, for each fc, there is an "optimal" a which 
achieves the smallest MSE. 
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Figure 4: Frequency moments, F/ a \, for RICE. Solid curves are empirical MSEs and 
dashed curves are theoretical asymptotic variances in (O, dTTT l. and Table [2] "gm" 
stands for the geometric mean estimator F( a y gm (O, and "gm,sym" for the geometric 
mean estimator in symmetric stable random projections\23\. 



• Using the optimal quantile estimator does not show a strong variance-bias trade- 
off, because its has very small variance near a = 1 and its MSEs are mainly 
dominated by the (intrinsic) biases, H a — H. 

Figure [6] presents the MSEs for estimating Shannon entropy using Tsallis entropy. 
The effect of the variance-bias trade-off for geometric mean and harmonic mean esti- 
mators, is even more significant, because the (intrinsic) bias T a — H is much larger. 

8 Conclusion 

Web search data and Network data are naturally data streams. The entropy is a useful 
summary statistic and has numerous applications, e.g., network anomaly detection. 
Efficiently and accurately computing the entropy in large and frequently updating data 
streams, in one-pass, is an active topic of research. A recent trend is to use the ath 
frequency moments with a w 1 to approximate Shannon entropy. We conclude: 

• We should use Renyi entropy to approximate Shannon entropy. Using Tsallis en- 
tropy will result in about a magnitude larger bias in a common data distribution. 

• The optimal quantile estimator for CC reduces the variances by a factor of 20 or 
70 when a = 0.989, compared to the estimators in [24|. 

• When symmetric stable random projections must be used, we should exploit the 
variance-bias trade-off, by not using a very close 1 . 
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Figure 5: Shannon entropy, H, estimated from Renyi entropy, H a , for RICE. 



A Proof of Lemma 1 

T « = 1- aii 1 p? and H - = ~ log K =iP? ■ Note that T^xPi = i. Y,txPi > i if 

a < 1 and p? < 1 if a > 1. 

For f > 0, — log(i) < 1 — t always holds, with equality when t = 1. Therefore, 
< 7q when a < 1 and > T a when a > 1. Also, we know lim Q ^i T Q = 
lima—,! i/ Q = ff. Therefore, to show \H a — H | < |T Q — it suffices to show that 
both T a and H a are decreasing functions of a € (0, 2). 

Taking the first derivatives of T Q and iJ Q yields 

T , = Ef = i pf - 1 - (a - 1) EjLi P? log Pi = 

(a-1) 2 (a -I) 2 

_ Z)jLl Pi l °eI2fLi Pi - ( a - 1) SZ^Li P° l°g Pi _ Ba 

("-i) 2 E? = ip? _ ("-i) 2 
To show T' a < 0, it suffices to show that A a < 0. Taking derivative of A a yields, 

D 

A'c = -{a- l)y2 Pi log 2 p l , 

i.e., A' a > if a < 1 and A' a < if a > 1. Because A 1 = 0, we know A a < 0. This 
proves T' a < 0. To show H' a < 0, it suffices to show that B a < 0, where 

B« = log p * + ® log pl i~ a ' where * = v d' ~ ■ 

i=l i=l l^i=lPi 

Note that X)i=i 9* = 1 an d nence we can view as probabilities. Since log() is a 
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Figure 6: Shannon entropy, H, estimated from Tsallis entropy, T a , for RICE. 



concave function, we can use Jensen's inequality: Elog(X) < logE(X), to obtain 



B a < log p" + log ^2 qip\ ™ = log p° + log 



B Proof of Lemma 2 

As a — > 1, using L'Hopital's rule 



lim T' = lim 



Ef=i p" - 1 - (« - 1) E 4 = i p? lo sPi 



— lim 



-(a-l)Y.? =1 Pt log 2 P. _ 1 



yi Pi log 2 Pi . 



2(a - 1) 2 



Note that, as a 1, ££i p? - 1 but Fj^ P? log" Pi -> Ei=i Pi log" ft ^ 0. 
Taking the second derivative of T a = — a-i yields 

2 - 2Ef =1 Pf + 2(a - 1) ££=1 pf logpi - (a - l) 2 £f =1 Pf log 2 Pi 



using L'Hopital's rule yields, 



lim T" = lim 



(« " I) 3 



-("-i) 2 E?=ipr ^g 3 p. 



3(q - l) 2 



1 13 

■ - >j p» log 3 p. 



where we skip the algebra which cancel some terms. 
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Again, applying L'Hopital's rale yields the expressions for lim a ^i H' a and lim a ^i H£ 

[E, D =i p? i°sEf=i p? - (« - i) Ef=i p? iog Pi l ' 

lim J?' = lim ± ; i — 

[(«-l) 2 Ef=iPf] 

[E,°i P? l°gP» log E? = i P? - (a - 1) EiLi P? lQ g 2 Pi] 
= °™ [2(a - 1) E? = i Pf + (" - I) 2 Ef = i P? l°g Pi] 

(ES=lP? logP») 2 /EiLiP? ~ ES=iP? l°g 2 P» + negligible terms 

— lim — - 

a ^ 1 2 Ei= i P? + negligible terms 

1 / D \ 2 1 D 2 

= o ^Pi lo SP« --^P.log 2 ?, 



We skip the details for proving the limit of H£, where 

H = tt (21) 

(«-i) 3 (E? =1 p?) 2 



C a = -(a - l) 2 Y2p" log 2 pi Pi - 2 [ Y2p? ) log ^ P° 

i=i i=i \i=i / i=i 

/ D \ 2 D D 

+ (a - l) 2 I J^p° logpi + 2(a - 1) J^P° logpi y^p° 
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